Goto

Collaborating Authors

 visual storytelling


From Image Captioning to Visual Storytelling

Passadakis, Admitos, Song, Yingjin, Gatt, Albert

arXiv.org Artificial Intelligence

Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.


VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

Gado, Mohamed, Taliee, Towhid, Memon, Muhammad, Ignatov, Dmitry, Timofte, Radu

arXiv.org Artificial Intelligence

Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. W e address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focus - ing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.


Generating Visual Stories with Grounded and Coreferent Characters

Liu, Danyang, Lapata, Mirella, Keller, Frank

arXiv.org Artificial Intelligence

Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story's themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.


Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition

Surikuchi, Aditya K, Fernández, Raquel, Pezzelle, Sandro

arXiv.org Artificial Intelligence

For both human speakers and which we test in a zero-shot manner. We machine learning models, the task requires connecting show that LLaVA (Liu et al., 2024), a powerful the visual data causally, to generate a narrative foundation model, performs best on the task, but consistent with the contents of the images. As for only slightly so than TAPM (Yu et al., 2021), a model-generated stories, evaluation is one of the model designed for visual storytelling which is 50 key challenges due to the inherently creative nature times smaller than LLaVA. Second, given insights of the task. Since human-written stories are derived from our proposed distance-based evaluation typically used to train visual storytelling models-- method, we upgrade the visual and language under the assumption that these stories provide a components on TAPM, resulting in a model that good learning signal--most previous work evaluated achieves comparable performance to LLaVA with model-generated stories by directly comparing a significantly lower number of parameters.


SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Wang, Eileen, Han, Soyeon Caren, Poon, Josiah

arXiv.org Artificial Intelligence

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.


DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models

Wu, Shengguang, Yuan, Mei, Su, Qi

arXiv.org Artificial Intelligence

Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.


GROOViST: A Metric for Grounding Objects in Visual Storytelling

Surikuchi, Aditya K, Pezzelle, Sandro, Fernández, Raquel

arXiv.org Artificial Intelligence

A proper evaluation of stories generated for a sequence of images -- the task commonly referred to as visual storytelling -- must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.


Visual Storytelling with Question-Answer Plans

Liu, Danyang, Lapata, Mirella, Keller, Frank

arXiv.org Artificial Intelligence

Visual storytelling aims to generate compelling narratives from image sequences. Existing models often focus on enhancing the representation of the image sequence, e.g., with external knowledge sources or advanced graph structures. Despite recent progress, the stories are often repetitive, illogical, and lacking in detail. To mitigate these issues, we present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative. Automatic and human evaluation on the VIST benchmark (Huang et al., 2016) demonstrates that blueprint-based models generate stories that are more coherent, interesting, and natural compared to competitive baselines and state-of-the-art systems.


Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology

Halperin, Brett A., Lukin, Stephanie M.

arXiv.org Artificial Intelligence

In this paper, we collect an anthology of 100 visual stories from Visual imagery and language have long since complemented each authors who participated in our systematic creative process of improvised other in visual storytelling. From children's picture books to comics story-building based on image sequences. Following close and news articles, this multimedia nexus forms a complementary reading and thematic analysis of our anthology, we present five interplay between imagery and spoken or written language. While themes that characterize the variations found in this creative visual audiences often experience stories and pictures together, visual storytelling process: (1) Narrating What is in Vision vs. Envisioning; images alone can also operate as starting points--sources of creative (2) Dynamically Characterizing Entities/Objects; (3) Sensing inspiration--for authors to write stories [42]. Researchers have Experiential Information About the Scenery; (4) Modulating the found that visual thinking [5, 6] and drawing [3] can prompt storytelling Mood; (5) Encoding Narrative Biases. In understanding the varied from a multitude of perspectives as long as creativity is not ways that people derive stories from images, we offer considerations disturbed in the process [17]. This affirms how creative writing and for collecting story-driven training data to inform automatic visual imagery are interconnected such that stories can be derived story generation. In correspondence with each theme, we envision from images to culminate in creative visual storytelling.


Vision Transformer Based Model for Describing a Set of Images as a Story

Malakan, Zainy M., Hassan, Ghulam Mubashar, Mian, Ajmal

arXiv.org Artificial Intelligence

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.